In this tutorial, you can run AtacAnnoR step by step to see how AtacAnnoR works and modify parameters to annotate scATAC-seq cells better.


Load required packages and datasets

The 10X-Multiome dataset is available from 10x genomics. They have been already processed as Seurat Objects.

library(AtacAnnoR)
library(Seurat)
## Attaching SeuratObject
setwd('~/txmdata/scATAC_seq/AtacAnnoR')
SeuratObj_RNA <- readRDS('./data/10X-Multiome/SeuratObj/SeuratObj_RNA.RDS')    # reference
SeuratObj_ATAC <- readRDS('./data/10X-Multiome/SeuratObj/SeuratObj_ATAC.RDS')    # query

Preprocessing reference and query data

Preprocessing of scATAC-seq data in AtacAnnoR includes two steps:

Gene-level signal extraction: gene activity matrix

For this dataset, gene activity was previously calculated by Signac::Geneactivity() with the assistance of fragments file, and saved in ‘ACTIVITY’ assay of the Seurat object SeuratObj_ATAC. ArchR can also be used to calculate the gene activity matrix using the fragments file.

If the fragment file is not available, we recommend using the function get_ga_from_peak_mtx() in AtacAnnoR to calculate the gene activity matrix. This function can run independently of the fragments file and achieve similar performance to the GeneActivity() function in Signac.

# hg38_gene_gr <- get_gene_gr(genome = 'hg38')
# gene_activity_mtx <- get_ga_from_peak_mtx(peak_counts = SeuratObj_ATAC[['ATAC']]@counts, gene_gr = hg38_gene_gr)

Genome-wide-level signal extraction: meta-program matrix

Meta-program matrix can be calculated using function get_nmf_embedding() in AtacAnnoR, and the default factor number is set to 50.

query_nmf_embedding <- get_nmf_embedding(peak_counts = SeuratObj_ATAC[['ATAC']]@counts,n_factors = 50)
## Performing TF-IDF normalization...
## Performing NMF...

We next keep the common features of reference and query.

pre_processing_mtxs <- pre_processing(ref_mtx = SeuratObj_RNA[['RNA']]@counts,
                                      query_mtx = SeuratObj_ATAC[['ACTIVITY']]@counts,
                                      verbose = T)
## Keep 16523 intersect genes
ref_mtx <- pre_processing_mtxs$ref_mtx
query_mtx <- pre_processing_mtxs$query_mtx

First round annotation

In the first round annotation, AtacAnnoR performs a gene-level annotation and select a fraction of cells as seed cell candidates.

Step 1: Identification of reference cell type-specific marker genes

In addition to the conventional marker genes (referred to as global marker genes here), we also identify the neighbor marker genes for a reference cell type that distinguish it from closely related adjacent cell types.

global_markers <- get_global_markers_sc(sc_counts_mtx = ref_mtx,
                                        labels = SeuratObj_RNA$true,
                                        max_marker = 200)
neighbor_celltypes <- get_neighbor_celltypes(sc_count_mtx = ref_mtx,
                                             labels = SeuratObj_RNA$true,
                                             global_markers,
                                             min.cor = 0.6)

neighbor_markers <- get_neighbor_markers_sc(sc_counts_mtx = ref_mtx,
                                            labels = SeuratObj_RNA$true,
                                            neighbor_celltypes = neighbor_celltypes,
                                            global_markers = global_markers)

We can assess the quality of the global marker genes by function plot_global_markers_heatmap()

plot_ref_global_markers_heatmap(ref_mtx = ref_mtx,
                                ref_labels = SeuratObj_RNA$true,
                                global_markers = global_markers,
                                neighbor_celltypes = neighbor_celltypes)

As shown in the heatmap, neighbor markers sometimes cannot distinguish cell subtypes, such as the B cell subtypes “Naive B”, “Intermediate B” and “Memory B”. So function plot_neighbor_markers_heatmap() helps to evaluate the efficacy of these markers in distinguishing neighboring cell types.

plot_ref_neighbor_markers_heatmap(ref_mtx = ref_mtx,
                                  ref_labels = SeuratObj_RNA$true,
                                  neighbor_markers = neighbor_markers,
                                  neighbor_celltypes = neighbor_celltypes,
                                  celltype_to_plot = 'Memory B')

Step 2: Determination of candidate cell type labels for each query cell

This is done by calculating the Kendall’s tau between each query cell’s gene activity profile and each reference cell type’s gene expression profile, and the candidate cell label is determined by selecting the most similar reference cell type. The predicted labels are saved in cell_meta$kendall_pred.

cor_mtx <- get_cor_mtx(sc_count_mtx = ref_mtx,
                       labels = SeuratObj_RNA$true,
                       query_mtx = query_mtx,
                       global_markers = global_markers,
                       query_nmf_embedding = query_nmf_embedding)
## Processing reference...
## Processing query...
## Intersect genes number: 756
## Computing Kendall correlation coefficient using 10 threads...
cell_meta <- get_kendall_pred(cor_mtx)

Step 3: Validation of candidate cell type labels

This step is done by testing whether the marker genes are of higher activity than the background genes using statistical test, which result in two scores: global marker significant score(GMSS) and neighbor marker significant score(NMSS). Then, the seed cell candidates are selected by choosing those cells with high scores using Gaussian Mixture Model (GMM).

cell_meta <- test_markers(query_mtx,cell_meta,global_markers,neighbor_markers,threads = 10,verbose = T)
## Start testing markers using 10 threads...
cell_meta <- get_seed_candidates(cell_meta)

At the end of the first round annotation, seed cell candidates are selected according to the gene-level information of ATAC cells, we can visualize the distribution of these cells on UMAP by function plot_pred_umap() or check their NMF meta-program profile quality by function plot_pred_nmf().

plot_pred_umap(Seurat.object = SeuratObj_ATAC,
               cell_meta = cell_meta,
               category = 'seed_candidate')
## Loading required package: Signac

plot_pred_nmf(query_nmf_embedding = query_nmf_embedding,
              cell_meta = cell_meta,
              neighbor_celltypes = neighbor_celltypes,
              category = 'seed_candidate')


Second round annotation

In the second round annotation, AtacAnnoR performs a genome-wide-level annotation of all query scATAC-seq cells using the seed cells from query itself to avoid batch effect.

Step 1: Seed cell cleaning

Seed cell cleaning is done by applying WKNN algorithm to the seed cell candidates themselves, and discarding those with inconsistent labels.

cell_meta <- seed_cleaning(cell_meta,query_nmf_embedding)

Also, the UMAP distribution and the NMF meta-program heatmap of the seed cells can be visualized by function plot_pred_umap() and plot_pred_nmf().

plot_pred_umap(Seurat.object = SeuratObj_ATAC,
               cell_meta = cell_meta,
               category = 'seed')

plot_pred_nmf(query_nmf_embedding = query_nmf_embedding,
              cell_meta = cell_meta,
              neighbor_celltypes = neighbor_celltypes,
              category = 'seed')

We can also check the heatmap illustrating the activity of the corresponding reference cell type-specific global/neighbor marker genes in query seed cells by function plot_seed_global_markers_heatmap() and plot_seed_neighbor_markers_heatmap() respectively.

plot_seed_global_markers_heatmap(query_mtx = query_mtx,
                                 cell_meta = cell_meta,
                                 global_markers = global_markers,
                                 neighbor_celltypes = neighbor_celltypes)

plot_seed_neighbor_markers_heatmap(query_mtx = query_mtx,
                                   cell_meta = cell_meta,
                                   neighbor_markers = neighbor_markers,
                                   neighbor_celltypes = neighbor_celltypes,
                                   celltype_to_plot = 'Memory B')

Step 2: Final prediction

Final prediction is done by applying WKNN algorithm again to predicted the left unlabeled cells using seed cells’ labels.

cell_meta <- WKNN_predict(cell_meta,query_nmf_embedding)

In the same way, the UMAP distribution and the NMF meta-program heatmap of the final predictions can be visualized by function plot_pred_umap() and plot_pred_nmf().

plot_pred_umap(Seurat.object = SeuratObj_ATAC,
               cell_meta = cell_meta,
               category = 'final')

plot_pred_nmf(query_nmf_embedding = query_nmf_embedding,
              cell_meta = cell_meta,
              neighbor_celltypes = neighbor_celltypes,
              category = 'final')

Finally, the cell type proportion of the reference, seed cell candidates, seed cells, and final predictions can be visualized by using function plot_celltype_proportions().

plot_celltype_proportions(cell_meta,ref_labels = SeuratObj_RNA$true)

Session Info

sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Signac_1.9.0       RcppML_0.5.6       SeuratObject_4.1.3 Seurat_4.3.0      
## [5] AtacAnnoR_0.3.0   
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.15        spam_2.9-1             fastmatch_1.1-3       
##   [4] plyr_1.8.8             igraph_1.4.2           lazyeval_0.2.2        
##   [7] sp_1.6-0               splines_4.2.3          BiocParallel_1.32.6   
##  [10] listenv_0.9.0          scattermore_0.8        GenomeInfoDb_1.35.16  
##  [13] ggplot2_3.4.2          digest_0.6.31          foreach_1.5.2         
##  [16] htmltools_0.5.5        fansi_1.0.4            magrittr_2.0.3        
##  [19] tensor_1.5             cluster_2.1.4          doParallel_1.0.17     
##  [22] ROCR_1.0-11            Biostrings_2.66.0      ComplexHeatmap_2.14.0 
##  [25] globals_0.16.2         matrixStats_0.63.0     spatstat.sparse_3.0-1 
##  [28] colorspace_2.1-0       ggrepel_0.9.3          xfun_0.38             
##  [31] dplyr_1.1.1            RCurl_1.98-1.12        crayon_1.5.2          
##  [34] jsonlite_1.8.4         progressr_0.13.0       spatstat.data_3.0-1   
##  [37] survival_3.5-5         zoo_1.8-11             iterators_1.0.14      
##  [40] glue_1.6.2             polyclip_1.10-4        gtable_0.3.3          
##  [43] zlibbioc_1.44.0        XVector_0.38.0         leiden_0.4.3          
##  [46] GetoptLong_1.0.5       shape_1.4.6            future.apply_1.10.0   
##  [49] BiocGenerics_0.44.0    abind_1.4-5            scales_1.2.1          
##  [52] mvtnorm_1.1-3          spatstat.random_3.1-4  miniUI_0.1.1.1        
##  [55] Rcpp_1.0.10            viridisLite_0.4.1      xtable_1.8-4          
##  [58] clue_0.3-63            reticulate_1.28        dotCall64_1.0-2       
##  [61] stats4_4.2.3           htmlwidgets_1.6.2      httr_1.4.5            
##  [64] RColorBrewer_1.1-3     ellipsis_0.3.2         ica_1.0-3             
##  [67] pkgconfig_2.0.3        farver_2.1.1           sass_0.4.5            
##  [70] uwot_0.1.14            deldir_1.0-6           utf8_1.2.3            
##  [73] tidyselect_1.2.0       labeling_0.4.2         rlang_1.1.0           
##  [76] reshape2_1.4.4         later_1.3.0            munsell_0.5.0         
##  [79] pbmcapply_1.5.1        tools_4.2.3            cachem_1.0.7          
##  [82] cli_3.6.1              generics_0.1.3         ggridges_0.5.4        
##  [85] evaluate_0.20          stringr_1.5.0          fastmap_1.1.1         
##  [88] yaml_2.3.7             goftest_1.2-3          knitr_1.42            
##  [91] fitdistrplus_1.1-8     purrr_1.0.1            RANN_2.6.1            
##  [94] pbapply_1.7-0          future_1.32.0          nlme_3.1-162          
##  [97] mime_0.12              RcppRoll_0.3.0         compiler_4.2.3        
## [100] rstudioapi_0.14        plotly_4.10.1          png_0.1-8             
## [103] spatstat.utils_3.0-2   tibble_3.2.1           bslib_0.4.2           
## [106] pcaPP_2.0-3            stringi_1.7.12         highr_0.10            
## [109] lattice_0.20-45        Matrix_1.5-1           vctrs_0.6.1           
## [112] pillar_1.9.0           lifecycle_1.0.3        spatstat.geom_3.1-0   
## [115] lmtest_0.9-40          jquerylib_0.1.4        GlobalOptions_0.1.2   
## [118] RcppAnnoy_0.0.20       bitops_1.0-7           data.table_1.14.8     
## [121] cowplot_1.1.1          irlba_2.3.5.1          GenomicRanges_1.50.2  
## [124] httpuv_1.6.9           patchwork_1.1.2        R6_2.5.1              
## [127] promises_1.2.0.1       KernSmooth_2.23-20     gridExtra_2.3         
## [130] IRanges_2.32.0         parallelly_1.35.0      codetools_0.2-19      
## [133] MASS_7.3-58.3          rjson_0.2.21           withr_2.5.0           
## [136] Rsamtools_2.14.0       sctransform_0.3.5      GenomeInfoDbData_1.2.9
## [139] S4Vectors_0.36.2       parallel_4.2.3         grid_4.2.3            
## [142] tidyr_1.3.0            rmarkdown_2.21         Cairo_1.6-0           
## [145] Rtsne_0.16             spatstat.explore_3.1-0 shiny_1.7.4